Comparative analysis of protein structure using multiscale additive functionals 
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This work reports a new methodology aimed at describing characteristics of protein structural 
shapes, and suggests a framework in which to resolve or classify automatically such structures 
into known families. This new approach to protein structure characterization is based on elements 
of integral geometry using biologically relevant measurements of shape and considering them on 
a multi-scale representation which align the proposed methodology to the recently reported tube 
picture of a protein structure as a minimal representation model. The method has been applied 
with good results to a subset of protein structures known to be especially challenging to revert into 
families, confirming the potential of the proposed method for accurate structure classification. 
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INTRODUCTION 



Evolution has produced a huge number of protein fam- 
ilies and super-families whose members possess similar 
sequences and three-dimensional structures. Restraints 
on evolutionary divergence are mainly related to the 
protein function, and therefore selective pressure tends 
to operate on the three-dimensional structure [![. The 
HOMSTRAD is an example of a database of pro- 
tein structures organized into homologous families. As 
a consequence of the global proteomic effort, the num- 
ber of known structures is growing at an impressive rate 
and has passed the total of 39000 structures. It is re- 
markable progress but, on the other hand, it also in- 
troduces an overwhelming amount of data to be man- 
ually classified on those databases. With more than 400 
structures solved every month, the challenge for auto- 
matic protein structural comparison and classification is 
greater than ever. Most of the protein comparison meth- 
ods depend mainly upon structural alignment and RMSD 
measures, and therefore are not completely reliable [3J. 
While RMSD is a good measure of structure similarity 
for almost identical proteins, it cannot be used to judge 
dissimilarity since it violates the triangle inequality. It 
means that any system based on RMSD alone is unable 
to cluster structures and, consequently incapable of clas- 
sifying them into families. In addition, the reliance on 
sequence alignments introduces a drawback because it is 
virtually impossible to avoid errors during the alignment 
construction. 

In this paper we investigate the potential of an al- 
gorithm adapted to automatically classify proteins into 
HOMSTRAD families. This algorithm is based on con- 
cepts of Integral Geometry [4|, know as Morphological 
Image Analysis (MIA), which has been recently applied 
to a series of problems due to its simplicity in design and 
implementation. Fields as diverse as Neuroscience [5j and 
Materials Sciences |6| have benefited from this approach. 



II. ADDITIVE SHAPE FUNCTIONALS 

We start by describing the mathematical aspects of 
the adopted procedure. The Minkowski functionals of 
a body K in the plane are proportional to the famil- 
iar geometric quantities of area A(K), perimeter U(K) 
and the connectivity or Euler number \{K). The usual 
definition of the connectivity from algebraic topology in 
two dimensions is the difference between the number of 
connected n c components and the number of holes rih, 
x{K) = n c — Tih- In the Euclidean space, there are two 
kinds of holes to consider. First, we have the pure hole, 
a completely closed region of white voxels surrounded by 
black voxels. Second, the tunnels. The Euler charac- 
teristic is consequently given as x(K) — n c ~ n t + Tin, 
where n t is the number of tunnels and rih is the number 
of pure holes. There is an additional geometric quan- 
tity to consider in the three-dimensional space, namely 
the mean curvature or breadth B{K). By exploring the 
additivity of the Minkowski functionals, their determina- 
tion reduces to counting the multiplicity of basic building 
blocks that disjointly compose the object. For example 
a voxel can be decomposed as a disjointed set of 8 ver- 
tices, 12 edges, 6 faces and one open cube. The same 
process can be applied to any object in a lattice. For a 
three-dimensional space, which is our interest regarding 
protein structures, see @, @, we have 

V(V)=n 3 , S(V) = -6n 3 + 2n 2 , (1) 
2B(V) = 3n 3 - 2n 2 + n\, x(P) = ~ n 3 + n 2 — n\ + n , 

Where is the number of interior cubes, n 2 is the num- 
ber of open faces, n\ is the number of sides and uq is 
the number of vertices. So, the procedure to calculate 
Minkowski functionals of a pattern V can be reduced to 
counting the number of elementary bodies of each type 
that compose a voxel (cubes, faces, edges and vertexes) 
belonging to V . 



III. PROTEIN STRUCTURE, TUBE PICTURE 
AND MULTISCALE SIGNATURES 



The protein structure in our approach is defined essen- 
tially by the geometrical/ topological nature of its back- 
bone. All a-carbon atom coordinates are identified from 
a .pdb file and an interpolation scheme is used to connect 
neighboring atoms by a straight path. This design pro- 
cedure attaches a variable resolution to the method, as 
the highly refined atomic scale data has to be truncated 
during the process. 

In our analysis the calculation of the Minkowski func- 
tional are incorporated into a multiscale framework. In 
such a scheme, all four quantities are calculated as a func- 
tion of a control parameter as some transformation is 
made on the structure of interest. In this paper we con- 
sider this transformation to be the process of exact dila- 
tions and the control parameter the dilation radius. Our 
choice is particularly suited as the exact dilation proce- 
dure naturally fits itself in what has been described as 
the tube picture for protein structure analysis Q , a mini- 
malist biophysical reasoning of the protein model. While 
the intricate aspects of the geometry/topology are ac- 
counted for at each spatial scale by the Minkowski func- 
tionals, the space surrounding the backbone is probed by 
performing the dilation of the structure and this informa- 
tion is condensed in what we call henceforth multiscale 
signatures. The behavior of such signatures, particularly 
the topologically related ones, can be discontinuous. For 
example the process of dilation may change abruptly the 
number of pure holes or tunnels at particular scales and 
these facts are registered for all scales in the multiscale 
signature for the connectivity or Euler number (charac- 
teristic). 



IV. RESULTS AND DISCUSSION 

Figure Q] shows all considered functionals signatures 
for a set of 71 proteins which were chosen specifically 
because of their similarities. The range of scales shown 
in these graphs encompasses the initial structure and the 
final filled volume without holes and tunnels (\ = 1)- 
There are both similarities and striking differences whose 
subtleties, until now, have been handled only by more 
complex algorithms. 

For each of those signatures in Figure [T] we select three 
features in an attempt to globally characterize the struc- 
ture and, by doing so, minimize the amount of data 
needed for future classification based on Minkowski func- 
tionals. For the signatures of Area and Perimeter, we 
evaluated the standard deviation, its integral, and the 
scale at which the integral of the curve reaches half of 
the total value. For the signatures of the Connectivity 
and of the Mean Breadth we measured the standard de- 
viation, the integral of the curve and the monotonicity 
index given by i — (i s + id + i P )/i s where i s .d,p are the 
counts for each time the curve increase, decrease or stay 
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TABLE I: The result of a classical discriminant analysis for 
the 12 features extracted from the multiscale signatures. 



constant. Table U shows the numeric results obtained 
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FIG. 2: A scatter plot derived from the mean breadth and 
the perimeter alone leads to a discriminative feature space. 

by classical discriminant analysis § based on the twelve 



above global measures and quantifies the classification 
potential of the proposed framework. Such a discrimi- 
nant analyis projects the measurements in such a way as 
to optimize their separation, expressed in terms of high 
interclass and low intraclass dispresions. It is remark- 
able that, although the structures were specially chosen 
to make a reduction into families difficult, this approach 
managed to perfectly classify four out of the five families. 
A mistake was made in class C, were it misclassified 1 out 
of 13 structures. It is worthwhile to note that although 
exhibiting different foldings, alpha plus beta in the class 
C and all alpha in the class E, their average length and 
topological properties in general are quite similar. Fig- 
ure [2] shows a two-dimensional section of the complete 
feature space defined by measures from the mean breadth 
and connectivity only. It provides a more economical dis- 
criminating clustering, albeit with overlaps. 

V. CONCLUSION 

In this paper we have accessed the potential of the 
multi-scale Minkowski functionals for protein morpholog- 
ical characterization and structural analysis. We found 
that these functionals are potentially suited to this kind 
of analysis, as substantiated by the results obtained for 
a distinct set of structures known to have highly similar 
topological features. For all but one family of structures, 
namely the glycosyl hydrolase family 22, the classification 
through a classical discriminant analysis yielded fully ac- 
curate results. These results are comparable with the 
best approach so far Q , which uses considerably more pa- 
rameters and is based on a complex concept. In addition 
to the classification result, it is important to emphasize 
the simplicity of the algorithm and the clear relationship 
between the quantities used for the characterization and 
familiar geometrical, topological and biological concepts. 
This direct relation to familiar measurements, combined 
with the simplicity for implementing the MIA approach, 
suggests that this kind of analysis is a particularly useful 
tool for classifying the shape of protein structures. 
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